



#### Matrix Transpose 1



- Consider an n×n matrix where 32 divides n.
- We focus on the device code:
  - the host code performs typical tasks: data allocation and transfer between host and device, the launching and timing of several kernels, result validation, and the deallocation of host and device memory.
- · Benchmarks illustrate this section:
  - we compare our matrix transpose kernels against a matrix copy kernel,
  - for each kernel, we compute the effective bandwidth, calculated in GB/s as twice the size of the matrix (once for reading the matrix and once for writing) divided by the time of execution





```
Transpose 'C' code: CPU version

O(n²)

void transpose_CPU(float in[], float out[])
{
    for(int j=0; j < N; j++)
        for(int i=0; i < N; i++)
            out[j + i*N] = in[i + j*N]; // out(j,i) = in(i,j)
}

https://github.com/udacity/cs344/blob/master/Lesson%20Code%20Snippets/Lesson%205%20Code%20Snippets/transpose.cu
```

## Transpose 'C' code: GPU version 1

## $O(n^2)$

```
// to be launched on a single thread
// transpose_serial < < < 1,1>>> (d_in, d_out);
__global__ void transpose_serial(float in[], float out[])
{
    for(int j=0; j < N; j++)
        for(int i=0; i < N; i++)
        out[j + i*N] = in[i + j*N]; // out(j,i) = in(i,j)
}</pre>
```

 $\frac{https://github.com/udacity/cs344/blob/master/Lesson\%20Code\%20Snippets/Lesson\%20Code\%20Snippets/transpose.cu}{n\%205\%20Code\%20Snippets/transpose.cu}$ 

## Transpose 'C' code: GPU version 2

```
// to be launched on a single thread
// transpose_parallel_per_row<<<1,N>>>(d_in, d_out);
__global__ void
transpose_parallel_per_row(float in[], float out[])
{
    int i = threadIdx.x;
    for(int j=0; j < N; j++)
        out[j + i*N] = in[i + j*N]; // out(j,i) = in(i,j)
}</pre>
```

https://github.com/udacity/cs344/blob/master/Lesson%20Code%20Snippets/Lesson%20Code%20Snippets/transpose.cu

#### Transpose 'C' code: GPU version 3

```
// dim3 blocks(N/K,N/K); // blocks per grid
// dim3 threads(K,K); // threads per block
// transpose_parallel_per_element<<<br/>blocks,threads>>>(d_in, d_out);
__global__ void
transpose_parallel_per_element(float in[], float out[])
{
    int i = blockIdx.x * K + threadIdx.x;
    int j = blockIdx.y * K + threadIdx.y;
    out[j + i*N] = in[i + j*N]; // out(j,i) = in(i,j)
}
```

 $\frac{https://github.com/udacity/cs344/blob/master/Lesson\%20Code\%20Snippets/Lesson\%20Code\%20Snippets/transpose.cu}{n\%205\%20Code\%20Snippets/transpose.cu}$ 

## Deep Dive: Performance-Aware Opt.

```
// dim3 blocks(N/K,N/K); // blocks per grid
// dim3 threads(K,K); // threads per block
// transpose parallel per element<<<br/>blocks.threads>>>(d in, d out);
_global__void
transpose parallel per_element(float in[], float out[])
{
    int i = blockldx.x * K + threadldx.x;
    int j = blockldx.y * K + threadldx.y;
    out[j + j*N] = in[i + j*N]; // out(j,j) = in(i,j)
}
```

N = Height or width (square matrix: height == width)
K = size of a thread block in x-dimension or y-dimension









## A simple copy kernel 1 — for Comparison \_\_global\_\_ void copy(float \*odata, float\* idata) { int x = blockldx.x\*TILE\_DIM + threadIdx.x; int y = blockldx.y\*TILE\_DIM + threadIdx.y; int width = gridDim.x \* TILE\_DIM; for (int j=0; j<TILE\_DIM; j+=BLOCK\_ROWS) { odata[(y+j)\*width + x] = idata[(y+j)\*width + x]; } } BLOCK\_ROWS j = 0 >> odata[y\*width + x] = idata[y\*width + x]; j = 1 >>odata[(y+1)\*width + x] = idata[(y+1)\*width + x]; https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/

#### A simple copy kernel 2 – for Comparison

- odata and idata are pointers to the input and output matrices,
- width = height = gridDim.x\*TILE\_DIM
- In this kernel, xindex and yindex are global 2D matrix indices and they used to calculate index, the 1D index used to access matrix elements.

```
__global__ void copy(float *odata, float* idata)
{
    int x = blockldx.x*TILE_DIM + threadIdx.x;
    int y = blockldx.y*TILE_DIM + threadIdx.y;
    int width = gridDim.x * TILE_DIM;

    for (int j=0; j<TILE_DIM; j+=BLOCK_ROWS) {
        odata[(y+j)*width + x] = idata[(y+j)*width + x];
    }
}
```



## Naive transpose kernel vs copy kernel

• The performance of these two kernels on a 1024x1024 matrix using a Tesla GPUs is given in the following table:

|                | Effective Bandwidth (GB/s, ECC enabled) |            |
|----------------|-----------------------------------------|------------|
| Routine        | Tesla M2050                             | Tesla K20c |
| сору           | 105.2                                   | 136.0      |
| transposeNaive | 18.8                                    | 55.3       |



https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/

```
Problem?

__global__ void transposeNaive(float *odata, float* idata)
{
    int x = blockldx.x*TILE_DIM + threadIdx.x;
    int y = blockldx.y*TILE_DIM + threadIdx.y;
    int width = gridDim.x*TILE_DIM;

for (int j=0; j<TILE_DIM; j+=BLOCK_ROWS) {
    odata[x*width + (y+j)] = idata[(y+j)*width + x];
}

(xindex, yindex): (0, 0) \rightarrow (1, 0) \rightarrow (2, 0) \rightarrow (3, 0) \rightarrow \cdots (31, 0)
\rightarrow (0, 1) \rightarrow (1, 1) \rightarrow (2, 1) \rightarrow (3, 1) \rightarrow \cdots (31, 1)
\rightarrow \cdots
```

```
Problem?

__global__ void transposeNaive(float *odata, float* idata) {
    int x = blockldx.x*TILE_DIM + threadIdx.x;
    int y = blockldx.y*TILE_DIM + threadIdx.y;
    int width = gridDim.x*TILE_DIM;

for (int j=0; j<TILE_DIM; j+=BLOCK_ROWS) {
    odata[x*width + (y+j)] = idata[(y+j)*width + x];
}

(xindex, yindex): (0,0) \rightarrow (1,0) \rightarrow (2,0) \rightarrow (3,0) \rightarrow \cdots (31,0)
\rightarrow (0,1) \rightarrow (1,1) \rightarrow (2,1) \rightarrow (3,1) \rightarrow \cdots (31,1)
\rightarrow \cdots

One Transaction .vs. 32 Transactions!
```

 Because device memory [GPU DDR Memory] has a much higher latency and lower bandwidth than on-chip memory [shared memory], special attention must be paid to: how global memory accesses are performed?



## Coalesced Transpose 2

- The simultaneous global memory accesses by each thread of a during the execution of a single read or write instruction will be coalesced into a single access if:
  - The size of the memory element accessed by each thread is either 4, 8, or 16 bytes.
  - The elements form a contiguous block of memory.
  - The i-th element is accessed by the i-th thread in the warp.





- The simultaneous global memory accesses by each thread of a during the execution of a single read or write instruction will be coalesced into a single access if:
  - The size of the memory element accessed by each thread is either 4, 8, or 16 bytes.
  - The elements form a contiguous block of memory.
  - The i-th element is accessed by the i-th thread in the warp.
- Last two requirements can be relaxed (compiler optimization) with compute capabilities of 1.2.
- Coalescing happens even if some threads do not access memory (divergent warp)





- Allocating device memory through cudaMalloc() and choosing TILE\_DIM to be a multiple of 16 ensures alignment with a segment of memory, therefore all loads from idata are coalesced.
- Coalescing behavior differs between the simple copy and naïve transpose kernels when writing to odata.



- The way to avoid uncoalesced global memory access is
  - to read the data into shared memory and,
  - have each warp access noncontiguous locations in shared memory in order to write contiguous data to odata.
- There is no performance penalty for noncontiguous access patterns in shared memory as there is in global memory.
- a \_\_synchthreads() call is required to ensure that all reads from idata to shared memory have completed before writes from shared memory to odata.

# Coalesced Transpose 8 \_\_global\_\_ void transposeCoalesced(float \*odata, const float \*idata) { \_\_shared\_\_ float tile[TILE\_DIM][TILE\_DIM]; int x = blockldx.x \* TILE\_DIM + threadIdx.x; int y = blockldx.y \* TILE\_DIM + threadIdx.y; int width = gridDim.x \* TILE\_DIM; for (int j = 0; j < TILE\_DIM; j += BLOCK\_ROWS) tile[threadIdx.y+j][threadIdx.x] = idata[(y+j)\*width + x]; \_\_syncthreads(); x = blockldx.y \* TILE\_DIM + threadIdx.x; // transpose block offset y = blockldx.x \* TILE\_DIM + threadIdx.y; for (int j = 0; j < TILE\_DIM; j += BLOCK\_ROWS) odata[(y+j)\*width + x] = tile[threadIdx.x][threadIdx.y + j]; }



- a warp of threads reads contiguous data from idata into rows of the shared memory tile.
- After recalculating the array indices, a column of the shared memory tile is written to contiguous addresses in odata.
- Because threads write different data to odata than they read from idata, we must use a block-wise barrier synchronization \_\_syncthreads().



- There is a dramatic increase in effective bandwidth of the coalesced transpose over the naive transpose, but there still remains a large performance gap between the coalesced transpose and the copy:
  - One possible cause of this performance gap could be the synchronization barrier required in the coalesced transpose.
  - This can be easily assessed using the following copy kernel which utilizes shared memory and contains a \_\_syncthreads() call.

```
__global__ void copySharedMem(float *odata, const float *idata) {
    __shared__ float tile[TILE_DIM * TILE_DIM];
    int x = blockldx.x * TILE_DIM + threadIdx.x;
    int y = blockldx.y * TILE_DIM + threadIdx.y;
    int width = gridDim.x * TILE_DIM;

for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
    tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x] = idata[(y+j)*width + x];
    __syncthreads();

for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
    odata[(y+j)*width + x] = tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x];
}
```

https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/

## Coalesced Transpose 12

| Effective Bandwidth (GB/s, ECC enabled) |             |            |  |
|-----------------------------------------|-------------|------------|--|
| Routine                                 | Tesla M2050 | Tesla K20c |  |
| copy                                    | 105.2       | 136.0      |  |
| copySharedMem                           | 104.6       | 152.3      |  |
| transposeNaive                          | 18.8        | 55.3       |  |
| transposeCoalesced                      | 51.3        | 97.6       |  |

 The shared memory copy results seem to suggest that the use of shared memory with a synchronization barrier has little effect on the performance.



#### Shared memory bank conflicts 1

 Shared memory is divided into 32 equally-sized memory modules, called banks, which are organized such that successive 32-bit words are assigned to successive banks.



 These banks can be accessed simultaneously, and to achieve maximum bandwidth to and from shared memory, the [threads in a warp should access shared memory associated with different banks].



#### Shared memory bank conflicts 2

- These banks can be accessed simultaneously, and to achieve maximum bandwidth to and from shared memory the threads in a warp should access shared memory associated with different banks.
- The exception to this rule is when all threads in a half warp read the same shared memory address, which results in a broadcast where the data at that address is sent to all threads of the half warp in one transaction.







#### Shared memory bank conflicts 6

- The coalesced transpose uses a 32×32 shared memory array of floats.
- For a shared memory tile of 32 × 32 elements, all elements in a column of data map to the same shared memory bank
  - Resulting in a worst-case scenario for memory bank conflicts:
     reading a column of data results in a 32-way bank conflict.
- A simple way to avoid this conflict is to pad the shared memory array by one column:

\_shared\_\_ float tile[TILE\_DIM][TILE\_DIM+1];







#### Granularity of Parallelism

- Size of a Tile?
  - We do test with a block of 32x8 threads with config. of "(32,8)"
  - What about 32x32?
    - "1024 threads wait at a barrier"
    - High Parallelism (?) but high synchronization overhead
  - What about 16x16?
    - "256 threads wait at a barrier"
    - Lower Parallelism (?) but lower synchronization overhead

#### Minimize timing waiting at a barrier!

https://github.com/jeonggunlee/cs344/blob/master/Lesson%20Code%20Snippets/L esson%205%20Code%20Snippets/transpose.cu



#### Conclusion - Transpose

- Understand CUDA performance characteristics
  - Memory coalescing
  - Bank conflicts
  - Granularity of parallelism
- Use peak performance metrics to guide optimization